Before you begin this worksheet this week, please reinstall DigitalMethodsData from GitHub by running: devtools::install_github("regan008/DigitalMethodsData") in your console. Also be sure that you have installed the Tidyverse library.

R has powerful tools for manipulating data. The Tidyverse is a collection of packages for R that are designed for data science. Take a look at the website for the Tidyverse and the list of packages that are included at: https://www.tidyverse.org/packages/

A Grammar of Data Manipulation with dplyr()

We’ll start with dplyr which is described as “a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.” The verbs included in this package are:

All of these verbs play nicely and combine naturally with group_by() which allows you to perform any operation “by group”.

Lets load some data and libraries for our work.

library(DigitalMethodsData)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(magrittr)
data("gayguides")

Select

Lets start with select(). This function allows you to subset columns using their names and types.

gayguides %>% 
  select(title, Year)

Notice that this subsetted the data and returned only the title and year. However, it didn’t modify the gayguides data or save it to a new variable because we didn’t assign the result to anything.

  1. Use select() to take the city and state from gayguides and add them to a dataframe called “locations”.
locations <- gayguides %>%
  select(city, state) %>%
  data.frame()
  1. What did you do to save the data to a new data frame? Why?

I assigned a new variable named “locations” to the result of the functions select() and data.frame(). Because if not, the result would not be saved into the Environment. The data.frame function created a new dataframe for the resulting selection of the seelct function.

  1. Can you use select() to grab all the columns of gayguides EXCEPT for the city and state? Hint: You might want to read the documentation for this function.
gayguides %>%
  select (!city & !state)

Filter

The filter function subsets a data frame and retains all the rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all of the conditions you provide.

gayguides %>% filter(Year > 1980)

Filter also works with the logical values we learned earlier this semester.

gayguides %>% filter(Year == 1970 | Year == 1980)

And strings:

gayguides %>% 
  filter(city == "Greenville")
  1. The above code grabs every location where the city is Greenville. However, there is more than one city named Greenville. Can you filter to retrieve Greenville, SC?
gayguides %>%
  filter(city == 'Greenville' & state== 'SC')
  1. How about every location between 1975 and 1980?
gayguides %>%
  filter(Year > 1975 & Year < 1980)
  1. Every city in Greenville, SC between 1975 and 1980?

(did you mean every “entry”?)

gayguides %>%
  filter(city == 'Greenville' & state == 'SC') %>%
  filter(Year > 1975 & Year < 1980)
  1. Can you find all locations in 1975 except for New York and San Francisco?
test.1975NOTnycsf <- gayguides %>%
  filter(Year == 1975 & city != 'New York') %>%
  filter (city != 'San Francisco')
  1. The amenity features column in gay guides contains a comma separated list of categorizations. (G), for example, stands for girls. However, this language changed over time and women’s locations eventually are described as (L). What if we want to filter by any entry that has (G) OR (L) in the amenity feature column? This is a bit more complicated because the entries are a comma separated list and (G) or (L) is often paired with other categorizations. How might you search the dataframe for entries that match (G) or (L)?
filter(gayguides, grepl('(G)', amenityfeatures))
filter(gayguides, grepl('(L)', amenityfeatures))

I could only figure out how to use grepl() to get two different dataframes, one for (G) and other for (L). I couldn’t get both results in one single dataframe.

Mutate

The mutate() function adds new variables and preserves existing one. This is useful when you want to create a new column based on other values. For example, in the statepopulation dataset, we want to ask “How much did the population increase between 1800 and 1900 in each state?.” We can do that by subtracting the population in 1900 from 1800 and storing that value in a new column.

data("statepopulations")
statepopulations %>% mutate(difference = X1900 - X1800) 
  1. In the Boston Women Voters dataset, every voter is given an age. Can you use their age to calculate each person’s birth year? (Assume all this data was collected in 1920.)
data("BostonWomenVoters")
BostonWomenVoters %>%
  mutate(birth.year = 1920 - Age)
  1. Can you create a new column that combines the city and state columns in gayguides into a new column called location? It should list the city, state. (i.e. San Diego, CA)
gayguides %>%
  mutate(location = paste(city, state, sep = ", "))

Arrange

Arrange() orders the rows of a data frame by the values of selected columns. In other words it sorts a data frame by a variable. In the gayguides data, we can sort the data by year with the earliest year first. If we wanted the latest year first, we could do so by using the desc() function.

gayguides %>%
  arrange(Year)
gayguides %>%
  arrange(desc(Year))
  1. Using the statepopulation data, which state has the largest population in 1850? Write code that pulls only the relevant columns (state and 1850) and sorts it accordingly.
statepopulations %>%
  arrange(desc(X1850)) %>%
  select(STATE, X1850)

The state of New York had the largest population in 1850.

Group_by() and Summarize()

Arrange is useful for finding the highest and lowest values, but it returns those values for the entire dataset. group_by(), in contrast, takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”. Lets look at what that means in practice:

mydata <- gayguides %>% 
            select(title, Year) %>%
            group_by(Year)

It doesn’t appear that this did much. But if you hover over this new variable in your environment pane, you’ll see that its now listed as a “grouped data frame.” Compare that to gayguides which is listed as just a data frame. This means that now we can run calculations on this data and it’ll perform them “by group”. Or, in other words, it’ll perform operations on each year within the dataset. That’s where summarize() comes in. summarize() creates a new data frame with one (or more) rows for each combination of grouping variables. In this case our grouping is by year, so the resulting data frame will group records by each year in the gayguides dataset.

gayguides %>% 
    select(title, Year) %>%
    group_by(Year) %>%
    summarize(count = n())

What happened here? In this example, we asked group_by to create groups based on year and then in summarize we created a column called count. We passed it the n() function which gives the current group size. What results, is a dataset that lists each year and how many locations that state has.

  1. You try, use group_by and summarize to find the total number of locations in each state, each year.
gayguides %>%
  select(state, Year) %>%
  group_by(state, Year) %>%
  summarize(count.locations = n())
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
  1. Summarize can do more than just count rows. Can you use summarize() to find the average age for each occupation in the Boston Women Voters data?
BostonWomenVoters %>%
  select(Age, Occupation) %>%
  group_by(Occupation) %>%
  summarize(average.age = median(Age))
  1. In the gayguides data, on average how many locations did each city in South Carolina have between 1970 and 1975?
gayguides %>%
  select(city, state, Year) %>%
    filter(state == 'SC' & Year > 1970 & Year < 1975) %>%
  group_by(city, Year) %>%
  summarise (count = n()) %>%
  summarise(average.locations = median(count))
## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.

I think this isn’t right. The median for Charleston, for example, should be 4.5. And for Myrtle Beach should be 1.75. I think it is rounding up, but I’m not sure how to show the decimal values.

  1. Filter the dataset for only the values in the southernstates list (created in the block below). Then tell me, how many locations were in all the southern states in 1975?
southernstates <- c("AL", "AR", "FL", "GA", "KY", "LA", "MD", "MS", "NC", "SC", "TN", "TX", "VI", "WV")
gayguides %>%
  select(state, Year) %>%
  filter(Year == 1975 & state == southernstates) %>%
  group_by(state) %>%
  summarise(locations = n())
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `Year == 1975 & state == southernstates`.
## Caused by warning in `state == southernstates`:
## ! longer object length is not a multiple of shorter object length

Re-Shaping Data: Joins and Pivots

Joins()

At some point, you might have a situation where you want to join two tables together. For example, in the almshouse_admissions dataset there is a column called “Occupational Code”.

data("almshouse_admissions")
head(almshouse_admissions$Descriptions.by.Clerk)
## [1] 10 50  6  3  6  0

For the purposes of working with this data in R, having only the code isn’t very useful. The code book for these occupations is available here:

almshouse.occupations <- read.csv(file="https://raw.githubusercontent.com/regan008/DigitalMethodsData/main/raw/almshouse-occupationalcodes.csv", header=TRUE)

A join allows us to join these two dataframes together, matching each row based on the occupational code provided in the Descriptions.by.Clerk column. To do that we’ll use a function known as a mutating join. A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other. In this case we want to join the matching rows from almshouse.occupations to almshouse_admissions. In an ideal world, the column names in the two data frames would match but since that isn’t the case, we’ll have to specify what columns left_join should use to join the two data frames.

almshouse_admissions <- left_join(almshouse_admissions, almshouse.occupations, by=c("Descriptions.by.Clerk"="code"))

head(almshouse_admissions)
  1. Below I’ve downloaded data about each of the census regions. Join this dataset with gayguides. Create a data frame that includes each of the regions and the total number of locations in 1980. How many locations appear in the Mountain region in 1980?
regions <- read.csv("https://raw.githubusercontent.com/regan008/DigitalMethodsData/main/raw/censusregions.csv")
gayguides_regions <- left_join(gayguides, regions, by=c("state"="State.Code")) 

gayguides_regions %>%
  select(title, Year, Division) %>%
  filter(Year == 1980) %>%
  group_by(Division) %>%
  summarise(locations = n())
  1. Explain what you did above. What variable did you join by and why? What results?

There are 285 locations in the Mountain region in 1980.First, I joined the state variable in gayguides with the State.Code variable in the regions dataset since they had the same values. I saved the output as a new dataframe called gayguides_regions. Then I passed it the filter function to show only the locations in 1980. With that, I grouped the result by Division (or region) and summarised the total of observations per region in 1980. There were NA values in the joined dataset because of inconsistencies in how the states were transcribed into the gayguides data, i.e., ‘DC’, ‘D.C.’, etc.

(@)How much did LGTBQ life grow between 1970 and 1980? Can you create a data frame that computes the growth in the number of locations between 1970 and 1980 for every state? For every region?

gayguides_stategrowth_1970.1980 <- gayguides_regions %>%
  select(state, Year) %>%
  filter (Year == 1970 | Year == 1980) %>%
  group_by(state, Year) %>%
  summarise (count = n())
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
stategrowth.calculated <- gayguides_stategrowth_1970.1980 %>%
  select(state, count) %>%
  group_by(state) %>%
  summarise(growth = max(count) - min(count))

gayguides_regiongrowth_1970.1980 <- gayguides_regions %>%
  select(Division, Year) %>%
  filter (Year == 1970 | Year == 1980) %>%
  group_by(Division, Year) %>%
  summarise (count = n())
## `summarise()` has grouped output by 'Division'. You can override using the
## `.groups` argument.
regiongrowth.calculated <- gayguides_regiongrowth_1970.1980 %>%
  select(Division, count) %>%
  group_by(Division) %>%
  summarise(growth = max(count) - min(count))

This last one was a RIDE.

pivot_longer() and pivot_wider(): Converting Wide and Long Data

It’s possible that you won’t create every dataset you use in R. Sometimes that means the dataset is in a format that isn’t useful for the questions you want to ask. The dataset below is what is referred to as a “wide” data frame. That is in comparison to a “long” data frame (which would be considered tidy data).

library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
sc.parks <- read.csv("https://raw.githubusercontent.com/regan008/DigitalMethodsData/main/raw/RecreationData-Wide.csv")
head(sc.parks)

This dataset contains all of the localities in South Carolina along with information about the types of recreational workers in that city (paid vs unpaid, male vs female). However, the problem with this dataset is that every year is a column heading making it difficult to work with. On the surface this seems like a useful format, partially because it reads left to right which is how we’re accustomed to reading documents. Its easy to compare, for example, the number of female paid recreation workers between 1930 and 1945. But for computational purposes this format is less than ideal for many types of visualizations and operations. R provides functions for dealing with this. pivot_longer() “lengthens” your data by increasing the number of rows and decreasing the number of columns.

sc.parks <- sc.parks %>%
  pivot_longer(!city:type_of_worker, names_to = "year", values_to = "count")
  1. What did this code do? > It “transfered” the column headers to a new column called “year” and the values of the columns to a new column called “count”. It also asked the program to preserve the columns between “city” and “type_of_worker”.

  2. Here’s another wide data frame. Can you turn this from a wide to a narrow data frame?

rec.spaces <- read.csv("https://raw.githubusercontent.com/regan008/DigitalMethodsData/main/raw/PlayAreabyType.csv")

rec.spaces <- rec.spaces %>%
  pivot_longer(!type, names_to = "year", values_to = "count")

The opposite of pivot_longer() is pivot_wider(). It “widens” data by increasing the number of columns and decreasing the number of rows. We can revert sc.parks back to a wide dataset using this function.

sc.parks %>%
  pivot_wider(names_from = year, values_from = count)
  1. Widen the sc.parks dataset so that the column names are drawn from the type of recreation worker.
sc.parks <- sc.parks %>%
  pivot_wider(names_from = type_of_worker, values_from = count)
  1. Turn rec.spaces into a wide dataframe.
rec.spaces <- rec.spaces %>%
  pivot_wider(names_from = type, values_from = count)

Putting it all together

Each of the functions covered in this worksheet are valuable tools for manipulating datasets. But they are more powerful when combined. When using them to pair down a dataset, we are asking and answering a question. For example in this code from earlier in our worksheet:

gayguides %>% 
    select(title, Year) %>%
    group_by(Year) %>%
    summarize(count = n())

The implicit question was, “How many locations appear in each year?”. The judges dataset in provided in the DigitalMethodsData package is a large, messy, wide dataframe that contains a lot of information. Look at this dataframe and then compose a question to ask of the data.

  1. First, tell me, what is the question you are asking?
data("judges")

How many nonwhite individuals occupied the seat of judge across the board?

  1. Now write some code to address that question. Comment the code with notes that explain your thinking as you go. Use functions like select(), filter(), etc to pair down your dataset and reshape it to address your question.
judges.nonwhite <- judges %>%
  select(Last.Name, Gender, Race.or.Ethnicity, ABA.Rating..1., ABA.Rating..2., ABA.Rating..3., ABA.Rating..4., ABA.Rating..5., ABA.Rating..6.) %>%
  filter(Race.or.Ethnicity != 'White') %>%
  group_by(Gender, Race.or.Ethnicity) %>%
  summarise(count = n())
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
#I've tried different things before reaching this final piece of code. But the judges dataset is so messy that I could barely ask any questions associated with time. There were also limited results when filtering by ethnicity or gender (not surprisingly). So I've decided to filter out every 'White' judge and then just group the filtered results by gender and race.
judges.nonwhite <- judges.nonwhite %>%
  pivot_wider(names_from = Race.or.Ethnicity, values_from = count)
# Now I've decided to pass it the function pivot_wider() to create a column for each ethnicity and leaving genders male and female as the only two observations. This makes it easier to answer the question of how many nonwhite individuals occupied the title of judge, but also expands the possible question to "how many of them were men and how many were female? And how were they classified in regard to race and ethnicity?"
  1. Now ask a question of the gayguides data (or another dataset of your choice). What is the question you are asking?

How many locations in 1975 were NOT classified as bars or clubs? How many of those were classified as Religious Institutions? And in which region of the country were religious institutions more concentrated?

  1. Now write some code to address that question. Comment the code with notes that explain your thinking as you go. Use functions like select(), filter(), etc to pair down your dataset and reshape it to address your question.
#first I'll subset the gayguides_regions dataset to get all the locations not classified as bars or clubs  in 1975.
gayguides.NOT.bars <- gayguides_regions %>%
  select(title, Year, type, Region) %>%
  filter(Year == 1975) %>%
  filter(type != 'Bars/Clubs')
#now I'll subset the gayguides.NOT.bars dataset to get all of the locations classified as religious.
gayguides.religious <- gayguides.NOT.bars %>%
  filter(grepl('Religious', type))
#lastly, I'll group the data by region and summarise the number of locations for each region.
gayguides.religious %>%
  group_by(Region) %>%
  summarise(locations = n())
  1. Write a function that filters the gay guides dataset. It should accept 2 arguments: year and state. When passed to the function the function should return only the title, type, state and year for each entry.
gayguides %>%
  select(title, type, state, Year) %>%
  filter(Year == 1978 & state == 'CA')